ZhuHuangTianChi-hw2

GNBF5030 Homework 2

Student id: 1155228903

Question 1

First, download orf_trans.fasta (S. cerevisiae S288C) from yeastgenome.org using wget.


xxxxxxxxxx
$ mkdir -p ~/5030/hw2/q1
$ cd ~/5030/hw2/q1
$ wget http://sgd-archive.yeastgenome.org/sequence/S288C_reference/orf_protein/orf_trans.fasta.gz

--2024-10-23 16:24:28--  http://sgd-archive.yeastgenome.org/sequence/S288C_reference/orf_protein/orf_trans.fasta.gz
Resolving sgd-archive.yeastgenome.org (sgd-archive.yeastgenome.org)... 52.218.132.178, 52.218.242.114, 52.218.237.242, ...
Connecting to sgd-archive.yeastgenome.org (sgd-archive.yeastgenome.org)|52.218.132.178|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2603046 (2.5M) [application/x-gzip]
Saving to: ‘orf_trans.fasta.gz’

orf_trans.fasta.gz                  100%[=================================================================>]   2.48M   304KB/s    in 7.4s    

2024-10-23 16:24:36 (345 KB/s) - ‘orf_trans.fasta.gz’ saved [2603046/2603046]

$ gunzip orf_trans.fasta.gz

Then create a database of these sequences.


xxxxxxxxxx
$ makeblastdb -in orf_trans.fasta -dbtype prot -out yeast_protein_db

Building a new DB, current time: 10/23/2024 16:24:49
New DB name:   /hdd1/home/f24_htczhu/5030/hw2/q1/yeast_protein_db
New DB title:  orf_trans.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 6039 sequences in 0.16837 seconds.

In order to find sequences that are similar to others, you need to blastp this queries against itself (using the database created in step2).


xxxxxxxxxx
$ blastp -query orf_trans.fasta -db yeast_protein_db -out blastp_output.txt -max_target_seqs 5 -evalue 1e-5 -outfmt 6

At the end, summarize the blast output and filter the proteins that are similar to the other in the yeast exome. Use wc command to show the number of proteins in your output file.


xxxxxxxxxx
$ wc -l blastp_output.txt

14880 blastp_output.txt

Question 2

Prepare the working directory and .sh files:


xxxxxxxxxx
File structure:
        |--- data 
        |      |--- trimmed_fastq_small
        |      |--- ref_genome
        |
        |--- results
               |--- sam
               |--- bam
               |--- bcf
               |--- vcf


xxxxxxxxxx
# Inital prepare
$ mkdir -p ~/5030/hw2/q2
$ cd ~/5030/hw2/q2
$ [ -f run_mapping.sh ] || touch run_mapping.sh
$ [ -f run_variant_calling.sh ] || touch run_variant_calling.sh
$ chmod +x run_mapping.sh
$ chmod +x run_variant_calling.sh

# Prepare testing file: ecoli_rel606.fasta ...
$ mkdir -p data/trimmed_fastq_small
$ cd data/trimmed_fastq_small
$ for file in /hdd1/shared/5030-lec6/data/trimmed_fastq_small/*; do ln -s $file; done
$ mkdir -p ~/5030/hw2/q2/data/ref_genome 
$ cd ~/5030/hw2/q2/data/ref_genome
$ ln -s /hdd1/shared/5030-lec6/data/ref_genome/ecoli_rel606.fasta ecoli_rel606.fasta
$ cd ~/5030/hw2/q2
$ mkdir -p results/sam results/bam results/bcf results/vcf

Content of the file run_mapping.sh (download):


#!/bin/bash

# Inputs: 
# paired_end_1 - the path to the first paired-end read file
# paired_end_2 - the path to the second paired-end read file
# reference_genome - the path to the reference genome file
# output_bam - the prefix path for the output bam files

paired_end_1=$1
paired_end_2=$2
reference_genome=$3
output_bam=$4  # only <prefix> !!!

# Index the reference genome for use by BWA
bwa index $reference_genome

# Align the reads to the reference genome
bwa mem $reference_genome $paired_end_1 $paired_end_2 > ${output_bam}.aligned.sam

# Convert the SAM file to BAM format
samtools view -S -b ${output_bam}.aligned.sam > ${output_bam}.aligned.bam

# Sort and index the BAM file
samtools sort -o ${output_bam}.aligned.sorted.bam ${output_bam}.aligned.bam
samtools index ${output_bam}.aligned.sorted.bam

echo "Mapping is complete."

Content of the file run_variant_calling.sh (download):


#!/bin/bash

# Inputs:
# sorted_bam_file - the path to the sorted and indexed BAM file
# reference_genome - the path to the reference genome file
# output_vcf - the prefix path for the output VCF file

sorted_bam_file=$1
reference_genome=$2
output_vcf=$3  # only <prefix> !!!

# Calculate the coverage using mpileup
bcftools mpileup -O b -o ${output_vcf}_raw.bcf -f $reference_genome $sorted_bam_file

# Call variants using bcftools
bcftools call --ploidy 1 -m -v -o ${output_vcf}_variants.vcf ${output_vcf}_raw.bcf

# Filter and report the SNPs
vcfutils.pl varFilter ${output_vcf}_variants.vcf > ${output_vcf}_final_variants.vcf

# Optionally, visualize the aligned reads (use space key to move around)
samtools tview $sorted_bam_file $reference_genome

echo "Variant calling is complete."

Run these scripts (example):


$ bash run_mapping.sh \
  data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq \
  data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq \
  data/ref_genome/ecoli_rel606.fasta \
  results/bam/SRR2584866
  
[bwa_index] Pack FASTA... 0.04 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 1.04 seconds elapse.
[bwa_index] Update BWT... 0.02 sec
[bwa_index] Pack forward-only FASTA... 0.02 sec
[bwa_index] Construct SA from BWT and Occ... 0.33 sec
[main] Version: 0.7.18-r1243-dirty
[main] CMD: bwa index data/ref_genome/ecoli_rel606.fasta
[main] Real time: 1.441 sec; CPU: 1.439 sec
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 77446 sequences (10000033 bp)...
[M::process] read 77296 sequences (10000182 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (48, 36728, 21, 61)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (420, 660, 1774)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 4482)
[M::mem_pestat] mean and std.dev: (784.68, 700.87)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 5836)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (221, 361, 576)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1286)
[M::mem_pestat] mean and std.dev: (412.89, 227.06)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1641)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (560, 2011, 2594)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 6662)
[M::mem_pestat] mean and std.dev: (1580.30, 978.54)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 8696)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (320, 549, 942)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 2186)
[M::mem_pestat] mean and std.dev: (581.31, 431.43)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 2808)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RF
[M::mem_pestat] skip orientation RR
[M::mem_process_seqs] Processed 77446 reads in 2.649 CPU sec, 2.598 real sec
[M::process] read 75458 sequences (10000048 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (65, 36742, 10, 60)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (172, 452, 1166)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3154)
[M::mem_pestat] mean and std.dev: (645.66, 709.63)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 4148)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (224, 368, 582)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1298)
[M::mem_pestat] mean and std.dev: (417.93, 229.24)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1656)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (2317, 2353, 2422)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (2107, 2632)
[M::mem_pestat] mean and std.dev: (2319.12, 75.27)
[M::mem_pestat] low and high boundaries for proper pairs: (2002, 2737)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (244, 513, 1366)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3610)
[M::mem_pestat] mean and std.dev: (781.13, 858.61)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 4732)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RF
[M::mem_pestat] skip orientation RR
[M::mem_process_seqs] Processed 77296 reads in 2.362 CPU sec, 2.268 real sec
[M::process] read 74054 sequences (10000155 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (58, 35818, 13, 79)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (288, 553, 1479)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3861)
[M::mem_pestat] mean and std.dev: (822.66, 856.80)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 5052)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (224, 361, 576)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1280)
[M::mem_pestat] mean and std.dev: (414.99, 227.46)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1632)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (1176, 2230, 2643)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 5577)
[M::mem_pestat] mean and std.dev: (1799.00, 921.05)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 7044)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (365, 601, 1880)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 4910)
[M::mem_pestat] mean and std.dev: (1144.51, 1264.10)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 6425)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RF
[M::mem_pestat] skip orientation RR
[M::mem_process_seqs] Processed 75458 reads in 2.356 CPU sec, 2.282 real sec
[M::process] read 45746 sequences (6219545 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (41, 35149, 14, 68)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (321, 488, 1274)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3180)
[M::mem_pestat] mean and std.dev: (607.09, 527.87)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 4133)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (220, 353, 566)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1258)
[M::mem_pestat] mean and std.dev: (407.78, 223.19)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1604)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (652, 2237, 2446)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 6034)
[M::mem_pestat] mean and std.dev: (1704.23, 991.13)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 7828)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (231, 409, 1018)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 2592)
[M::mem_pestat] mean and std.dev: (513.68, 468.59)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 3379)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RF
[M::mem_pestat] skip orientation RR
[M::mem_process_seqs] Processed 74054 reads in 2.340 CPU sec, 2.266 real sec
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (40, 21701, 9, 32)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (235, 720, 1382)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3676)
[M::mem_pestat] mean and std.dev: (770.19, 676.46)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 4823)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (219, 350, 560)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1242)
[M::mem_pestat] mean and std.dev: (404.41, 221.74)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1583)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (325, 557, 721)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1513)
[M::mem_pestat] mean and std.dev: (501.66, 309.62)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1909)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RR
[M::mem_process_seqs] Processed 45746 reads in 1.457 CPU sec, 1.421 real sec
[main] Version: 0.7.18-r1243-dirty
[main] CMD: bwa mem data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq
[main] Real time: 10.964 sec; CPU: 11.308 sec
Mapping is complete.

$ bash run_variant_calling.sh \
  results/bam/SRR2584866.aligned.sorted.bam \
  data/ref_genome/ecoli_rel606.fasta \
  results/vcf/SRR2584866
  
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
Variant calling is complete.